matplotlibseabornplotlyLet's import numpy and pandas and load up some data to work with.
import numpy as np
import pandas as pd
# load data
thefts_joined = pd.read_csv('/content/data/bike_thefts_joined.csv',
dtype={'n_id': str})
neighbourhoods = pd.read_csv('/content/data/neighbourhoods.csv',
dtype={'n_id': str})
# fix dates
thefts_joined['occurrence_date'] = pd.to_datetime(thefts_joined['occurrence_date'])
thefts_joined['report_date'] = pd.to_datetime(thefts_joined['report_date'])
thefts_joined.head()
# exclude the City of Toronto
neighbourhoods = neighbourhoods.loc[neighbourhoods['neighbourhood'] != 'City of Toronto']
neighbourhoods.head()
# add new columns showing % of commuters for each mode
def calc_pct(mode):
return round(mode/neighbourhoods['total_commuters'], 3)
# new column names
pct_cols = ['pct_drive', 'pct_cp', 'pct_transit', 'pct_walk']
neighbourhoods[pct_cols] = neighbourhoods.loc[:, 'drive':'walk'].apply(calc_pct)
So far, we have gotten data, wrangled it, and scratched the surface of exploratory analyses. As part of that exploration, we created charts with pandas. However, there are dedicated visualization libraries let us customize our charts further.
matplotlib¶matplotlib¶matplotlib is the foundational data visualization library in Python. pandas's visualization functions are, at their core, matplotlib functions. Other popular libraries like seaborn similarly build on matplotlib.
For historical reasons, when we import matplotlib, we really import matplotlib.pyplot. The conventional alias is plt.
# jupyter-specific "magic" command to render plots in-line
%matplotlib inline
import matplotlib.pyplot as plt
matplotlib visuals consist of one or more Axes in a Figure. An Axes, confusingly, is what we would consider a graph, while the Figure is a container for those graphs. An Axes has an x-Axis and a y-Axis.
More details can be found at: https://matplotlib.org/stable/tutorials/introductory/quick_start.html
matplotlib¶matplotlib provides two ways to create visualizations:
pyplot automatically create and manage Figures and Axes, keeping track of which Figure and Axes we are currently working onThe object-oriented approach is recommended, but the pyplot approach is convenient for quick plots.
pyplot-style plotting¶pyplot-style plotting is convenient for quick, exploratory plots, where we don't plan on doing a lot of customization. When we plotted data in pandas, pandas took this approach. Let's plot the neighbourhood data with the pyplot approach. plot() produces a line plot by default.
plt.plot(neighbourhoods['pop_2016'],
neighbourhoods['bike'])
Let's make it a scatterplot instead with the scatter() function.
We can use keyword arguments like facecolor and edgecolor to change the styling. matplotlib lets us specify colour with RGB(A) tuples, hexadecimal strings, single-character shortcodes, and even xkcd colours.
plt.scatter(neighbourhoods['pop_2016'],
neighbourhoods['bike'],
marker='s', # square marker
facecolor='#fb1',
edgecolor='k') # black
Using the pyplot approach, the outputs of successive function calls in the same cell context are layered on. Let's layer driving and biking commuter counts and add a legend.
plt.scatter(neighbourhoods['pop_2016'],
neighbourhoods['drive'],
edgecolor='k',
label='Driving')
plt.scatter(neighbourhoods['pop_2016'],
neighbourhoods['bike'],
edgecolor='w',
label='Cycling')
plt.legend()
Calls in a different cell are treated as a new Axes.
plt.grid()
The object-oriented approach is the preferred method of plotting with matplotlib. In this approach, we use the subplots() function to create plot objects, then call methods to modify them.
By default, subplots() returns one Figure and one Axes. We can use Python's unpacking syntax to assign the Figure and Axes to their own variables in one line.
fig, ax = plt.subplots()
print(f'{type(fig)}, {type(ax)}')
The Axes is empty. Let's plot data on it with the Axes scatter() method. This method updates ax with a scatterplot. To make it easier to refer to each scatterplot later, we assign the outputs to their own variables, drivers and cyclists.
drivers = ax.scatter(neighbourhoods['pop_2016'],
neighbourhoods['drive'])
cyclists = ax.scatter(neighbourhoods['pop_2016'],
neighbourhoods['bike'])
total = ax.scatter(neighbourhoods['pop_2016'],
neighbourhoods['total_commuters'])
fig
This graph doesn't give much context. To add a title, we can use the Axes set_title() method, which takes the title as a string, plus optional arguments like fontsize. Similarly, we can set x and y labels with the set_xlabel() and set_ylabel() methods. Finally, let's add a grid with the Axes grid() method, and use the alpha parameter to make it translucent. We'll also use the set_axisbelow() method to make sure markers draw over the grid.
ax.set_title('Neighbourhood Population vs Commuter Population')
ax.set_xlabel('Population, 2016')
ax.set_ylabel('Commuters')
ax.set_axisbelow(True)
ax.grid(alpha=0.3)
fig
This graph could use a legend. To add one, we call the Axes legend() method. If we passed a label argument in the scatter() calls, legend() would use those labels. However, because we did not, we pass a list of the geometries to use in the legend, plus a list of labels to show.
ax.legend([drivers, cyclists, total],
['Drivers', 'Cyclists', 'Total Commuters'])
fig
To place the legend outside the Axes, we can pass a tuple with the bbox_to_anchor argument. The legend's loc corner will be placed at the coordinates in the bbox_to_anchor tuple.
ax.legend([drivers, cyclists, total],
['Drivers', 'Cyclists', 'Total Commuters'],
bbox_to_anchor=(1, 1),
loc='upper left')
fig
We can change how the x-axis and y-axis are formatted by accessing an Axes xaxis and yaxis attributes and calling methods like set_ticks() or set_major_formatter().
Some configurations of Python and matplotlib allow us to pass a format string by itself to set_major_formatter(). Older versions require that we import matplotlib's ticker submodule and create a StrMethodFormatter with the format string we want to use.
import matplotlib.ticker as tick
# label with a thousands place comma and zero decimal places
ax.xaxis.set_major_formatter(tick.StrMethodFormatter('{x:,.0f}'))
ax.yaxis.set_major_formatter(tick.StrMethodFormatter('{x:,.0f}'))
fig
We can also change axis limits.
#ax.xaxis.set_ticks(np.arange(0, max(neighbourhoods['pop_2016']+10), 10000))
ax.axis()
ax.set(ylim=(0, ax.axis()[1])) # make the y-axis match the x-axis
fig
matplotlib comes with a bunch of predefined styles. We can view the available ones with plt.style.available. Passing one of the options to style.use() makes it the aesthetic style for all new plots. Already created Figures and Axes are not affected.
plt.style.available[5:10] # print a subset
# set style for new plots
plt.style.use('fivethirtyeight')
# notice that the style of fig did not change
fig
Of course, matplotlib offers more than just line plots and scatterplots. Among the many kinds of plots we can make are bar plots, histograms, and boxplots. To create each the object-oriented way, we call the appropriate Axes method, like Axes.boxplot() or Axes.barh(), for a horizontal bar plot.
# review the neighbourhoods data
neighbourhoods.head()
# get just the 10 biggest neighbourhoods to plot
top10_pop = neighbourhoods.sort_values('pop_2016', ascending=False).head(10)
top10_pop
bar_fig, bar_ax = plt.subplots()
bar_ax.barh(top10_pop['neighbourhood'], top10_pop['pop_2016'])
bar_ax.xaxis.set_major_formatter(tick.StrMethodFormatter('{x:,.0f}'))
bar_ax.set_axisbelow(True)
bar_ax.grid(alpha=0.3)
bar_ax.set_title('Most Populous Toronto Neighbourhoods')
bar_ax.set_xlabel('Population, 2016')
# create a box plot
box_fig, box_ax = plt.subplots()
box_ax.boxplot([neighbourhoods['pct_transit'],
neighbourhoods['pct_walk'],
neighbourhoods['pct_drive']],
# add labels so we know which box is which var
labels=['% Transit', '% Walk', '% Drive'])
box_ax.yaxis.set_major_formatter(tick.StrMethodFormatter('{x:.0%}'))
box_ax.set_title('Neighbourhood Commuter Modes')
# create a histogram
hist_fig, hist_ax = plt.subplots()
hist_ax.hist(neighbourhoods['transit'],
# count the neighbourhoods with 0-1000 transit commuters,
# 1001-2000 transit commuters, etc
bins=range(0, 12000, 1000))
hist_ax.xaxis.set_major_formatter(tick.StrMethodFormatter('{x:,.0f}'))
hist_ax.set_title('Transit Commuter Distribution')
hist_ax.set_xlabel('# of Transit Commuters')
hist_ax.set_ylabel('# of Neighbourhoods')
We've seen that a single Axes can have more than one set of data points plotted on it with our multi-modal scatterplot. We can similarly layer on other graphics, using the alpha argument to set transparency.
layer_fig, layer_ax = plt.subplots()
settings = {'alpha': 0.4, 'bins': np.arange(0, 1, .05)}
layer_ax.hist(neighbourhoods['pct_drive'], label='Drive', **settings)
layer_ax.hist(neighbourhoods['pct_transit'], label='Transit', **settings)
layer_ax.xaxis.set_major_formatter(tick.StrMethodFormatter('{x:.0%}'))
layer_ax.set_axisbelow(True)
layer_ax.grid(alpha=0.2, linestyle='--', axis='y')
layer_ax.set_title('Commute Mode Distribution')
layer_ax.legend()
layer_ax
Let's try plotting the number of reported bike thefts each year by whether the bike was recovered or not. We'll need to wrangle the theft data a bit to get counts by year and status. Then, we'll use the data to make a stackplot(). Finally, we'll style it.
# review the available columns
thefts_joined.columns
thefts_grouped = (thefts_joined
.groupby(['occurrence_year', 'status'])
.agg(thefts=('_id', 'count'))
.reset_index() # make occurrence year a regular col
.pivot(index='occurrence_year', columns='status', values='thefts')
.reset_index() # ...and again
.fillna(0))
thefts_grouped
stfig, stax = plt.subplots()
stax.stackplot(thefts_grouped['occurrence_year'], thefts_grouped['STOLEN'],
thefts_grouped['RECOVERED'], thefts_grouped['UNKNOWN'],
labels=['Stolen', 'Recovered', 'Unknown'])
stax.set_axisbelow(True)
stax.grid(alpha=0.3)
stax.legend(loc='upper left')
stax.set_title('Reported Bike Thefts by Recovery Status')
stax.set_ylabel('Reported Thefts')
stax.set_xlabel('Year')
We can create multiple Axes in one Figure by passing nrows and ncols arguments to subplots(). The number of Axes we get equals nrows * ncols. Multiple Axes are returned as a numpy array.
modefig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True)
ax1.scatter(neighbourhoods['pop_2016'],
neighbourhoods['drive'])
ax2.scatter(neighbourhoods['pop_2016'],
neighbourhoods['bike'])
ax1.set_title('Drivers')
ax2.set_title('Cyclists')
ax1.yaxis.set_major_formatter(tick.StrMethodFormatter('{x:,.0f}'))
ax1.xaxis.set_major_formatter(tick.StrMethodFormatter('{x:,.0f}'))
ax2.xaxis.set_major_formatter(tick.StrMethodFormatter('{x:,.0f}'))
modefig.suptitle('Commuters by Mode and Neighbourhood Population')
modefig.tight_layout()
As the number of subplots grows, it gets cumbersome to unpack them in the assignment statement. We can temporarily assign all of them to a single variable.
# make a 2x2 grid of subplots
modefig2, mode_ax = plt.subplots(nrows=2, ncols=2, sharey=True, sharex=True)
mode_ax
The Axes are arranged in a 2x2 array. It would be more straightforward to refer to them if we had a 1x4 array instead.
# accessing items in a 2x2 array can be annoying
mode_ax
# example: getting the bottom left Axes
mode_ax[1, 0]
We can take advantage of numpy arrays' flatten() method. Recall that flatten() returns a new array with all the elements arranged in a single row. We can then unpack the elements of that row and assign them to individual variables.
# recall what flatten() does
mode_ax.flatten()
a1, a2, a3, a4 = mode_ax.flatten()
modefig2 # we haven't changed the Figure
Plotting commute mode against total population four times will be tedious. To reuse code, we can write a helper function that takes an Axes, the mode we're plotting, and a dictionary of style parameters and updates the Axes. **param_dict unpacks the dictionary of parameters and arguments passed to plot_modes() and passes them on to scatter().
def plot_modes(ax, mode, param_dict):
'''
Helper function to plot neighbourhood pop
against commuting mode.
'''
defaults = {'alpha': 0.45, 's': 10}
defaults.update(param_dict)
out = ax.scatter(neighbourhoods['pop_2016'],
neighbourhoods[mode],
**defaults)
return out
Then, we can call plot_modes to plot each of the subplots.
# add data to each axes
plot_modes(a1, 'drive', {'label': 'drive', 'facecolor': 'k'})
plot_modes(a2, 'transit', {'label': 'transit', 'facecolor': 'b'})
plot_modes(a3, 'walk', {'label': 'walk', 'facecolor': 'm'})
plot_modes(a4, 'bike', {'label': 'bike', 'facecolor': 'g'})
modefig2.legend(bbox_to_anchor=(1, 1), loc='upper left')
modefig2.tight_layout()
modefig2.suptitle('Commuter Modes')
modefig2
Successive method calls on an Axes object layer on graphics. To clear everything from an Axes, we can use its clear() method. To clear every subplot in a Figure, we can loop through the flattened array of Axes and clear() each Axes in turn.
for axes in mode_ax.flatten():
axes.clear()
modefig2
# let's reset our style before moving on
plt.style.use('default')
seaborn¶seaborn¶seaborn builds upon and complements matplotlib, producing nicer-looking Axes with less code, and giving us a few more convenient plot types. seaborn is typically given the alias sns, after a pop culture reference.
import seaborn as sns
With seaborn, we have two ways of structuring arguments to plotting functions:
x and y axis columnsdata we are visualizing, then the x and y axis columns# use x and y axis columns
sns.scatterplot(x=neighbourhoods['pop_dens'],
y=neighbourhoods['pct_transit'])
# use the dataframe and column names
sns.scatterplot(data=neighbourhoods,
x='pop_dens',
y='pct_transit')
For comparison, we can create the same plot using matplotlib's pyplot approach.
plt.scatter(neighbourhoods['pop_dens'],
neighbourhoods['pct_transit'])
seaborn and object-oriented matplotlib¶We can use seaborn as a complement to matplotlib's object-oriented approach. seaborn functions that work in individual plots have an optional keyword argument that lets us pass in an existing Axes to update. As a bonus, they return the Axes we're working with, making it easy to chain methods together.
Let's revisit our 10 biggest Toronto neighbourhoods chart.
bar_fig
This was the code to create that plot. We'll recreate it with seaborn.
bar_fig, bar_ax = plt.subplots()
bar_ax.barh(top10_pop['neighbourhood'], top10_pop['pop_2016'])
bar_ax.xaxis.set_major_formatter('{x:,.0f}')
bar_ax.set_axisbelow(True)
bar_ax.grid(alpha=0.3)
bar_ax.set_title('Most Populous Toronto Neighbourhoods')
bar_ax.set_xlabel('Population, 2016')
And with seaborn:
sns.set_theme() # use seaborn's default style settings going forward
sns_fig, sns_ax = plt.subplots() # create a Figure and Axes
(sns.barplot(data=top10_pop, # set datasource
x='pop_2016', # for a horizontal bar graph
y='neighbourhood',
ax=sns_ax) # plot on an existing Axes
.set(xlabel='Population, 2016',
ylabel='Neighbourhood'))
# .set() returns text, so we can't chain .set_title()
sns_ax.set_title('Most Populous Toronto Neighbourhoods',
fontdict={'fontsize': 18})
sns_ax.xaxis.set_major_formatter(tick.StrMethodFormatter('{x:,.0f}'))
With matplotlib, we created individual subplots and updated them with a helper function to visualize data for different categories. With seaborn, we can create a FacetGrid and then use its map() method to visualize data by category. map() takes the name of the plotting function to use, then the needed arguments, such as the columns to use for the x-axis and y-axis.
# reshape neighbourhood data to support faceting
neighbourhoods_reshaped = (neighbourhoods[['neighbourhood',
'pct_transit',
'pct_drive',
'pct_walk',
'pct_bike']]
.melt(id_vars='neighbourhood'))
neighbourhoods_reshaped.head()
# specify the data to use and the column to facet by
# we'll give each variable its own row
facets = sns.FacetGrid(data=neighbourhoods_reshaped,
row='variable')
# create a histogram for each mode
facets.map(sns.histplot, 'value', binwidth=0.05)
facets.set_axis_labels('%', '# of Neighbourhoods')
For another example, we can plot reported bike thefts by year, faceted by status.
# reshape the theft counts to support faceting
theft_counts_long = thefts_grouped.melt(id_vars='occurrence_year',
value_name='Count')
# specify the data to use and the column to facet by
# we'll give each status its own row
facets = sns.FacetGrid(data=theft_counts_long, row='status')
# for each status, create a lineplot of counts by year
facets.map(sns.lineplot, 'occurrence_year', 'Count')
facets.set_axis_labels('Occurrence Year')
seaborn's pair plots are particularly useful for exploratory analyses. pairplot() takes a DataFrame or series of columns and creates a Figure containing grid of scatterplots, allowing us to visually look for relationships between variables.
# review the columns available
neighbourhoods.columns
# review just the numeric columns
neighbourhoods.select_dtypes('number').columns
# select some columns to use in the pair plot
cols = ['pop_2016', 'total_commuters', 'pct_drive', 'pct_transit', 'pct_walk', 'pct_bike']
simple_pairs = sns.pairplot(neighbourhoods[cols])
# if we include non-numeric variables, they won't be plotted, but we can use them for hue
cols = ['pop_2016', 'designation', 'total_commuters', 'pct_drive', 'pct_transit', 'pct_walk', 'pct_bike']
pairwise_fig = sns.pairplot(neighbourhoods[cols], hue='designation')
We can combine seaborn's heatmap() function with the pandas Dataframe corr() method to explore correlations in our data.
# calculate correlations with pandas
correlations = neighbourhoods.loc[:, 'pct_bike':].corr('kendall')
# create a figure and axes
corr_fig, corr_ax = plt.subplots()
corr_fig.set_size_inches(5, 4)
sns.heatmap(correlations, ax=corr_ax, annot=True)
To save a plot, use the Figure savefig() method, which supports exporting figure in common formats like PNG, PDF, and SVG. Setting bbox_inches='tight' will make matplotlib try to figure out the dimensions of the plot and crop the image appropriately. Note that seaborn does not have a plot saving function of its own.
pairwise_fig.savefig('pairs.svg', bbox_inches='tight')
corr_fig.savefig('correlations.png', bbox_inches='tight')
plotly¶plotly¶plotly gives us a way to create interactive graphics within Python, building on the plotly.js library rather than matplotlib. Plotly Express provides an entry point to making data visualizations with the package. Let's re-create the drivers vs cyclists scatterplot to start.
import plotly.express as px
plotly_fig = px.scatter(neighbourhoods,
x='drive',
y='bike',
title='Commute Modes')
plotly_fig.show(renderer='notebook') # ensure plot renders nicely in notebook mode
# add hover data
plotly_fig = px.scatter(neighbourhoods,
x='drive',
y='bike',
hover_name='neighbourhood', # show neighbourhood on hover
labels={'bike': 'Bike', 'drive':'Drive'},
title='Commute Modes')
plotly_fig.show(renderer='notebook') # ensure plot renders nicely in notebook mode
print(top10_pop.columns)
hist_fig = px.bar(top10_pop,
x=['pct_drive', 'pct_cp', 'pct_transit', 'pct_walk', 'pct_bike'],
y='neighbourhood',
hover_name='neighbourhood',
hover_data=['drive', 'car_passenger', 'transit', 'walk', 'bike'],
labels={'variable': 'Mode',
'value': '%'}
)
hist_fig.show(renderer='notebook')
# view available themes
import plotly.io as pio
pio.templates
bar_fig = px.bar(top10_pop,
x='pop_2016',
y='neighbourhood',
text='pop_2016',
labels={'pop_2016': 'Population, 2016',
'neighbourhood': 'Neighbourhood'},
hover_data={'neighbourhood': False,
'pop_2016':False,
'pop_change': ':.2p'}, # add pop change, formatted as %
title='Top Toronto Neighbourhoods by Population',
template='seaborn'
)
bar_fig.show(renderer='notebook')
plotly graphs¶For added control over visualizations, we can import plotly's graph_objects submodule.
import plotly.graph_objects as go
transit_hist = go.Histogram(x=neighbourhoods['pct_transit'], name='Transit')
drive_hist = go.Histogram(x=neighbourhoods['pct_drive'], name='Drive')
data = [drive_hist, transit_hist]
layout = go.Layout(template='seaborn',
title='Commute Mode Distribution',
xaxis={'title': 'Mode %'},
yaxis={'title': 'Neighbourhoods'}
)
fig = go.Figure(data=data, layout=layout)
fig.update_layout(hovermode='x')
fig.show(renderer='notebook')
plotly visualizations¶We can save visualizatons created in plotly to image or PDF with the write_image() Figure method. Note that write_image() needs the kaleido package to work.
!pip install -U kaleido
import kaleido
fig.write_image('fig.pdf', format='pdf')